Background: Low-cost DNA sequencing allows organizations to accumulate massive amounts of genomic data and\nuse that data to answer a diverse range of research questions. Presently, users must search for relevant genomic data\nusing a keyword, accession number of meta-data tag. However, in this search paradigm the form of the query ââ?¬â?? a\ntext-based string ââ?¬â?? is mismatched with the form of the target ââ?¬â?? a genomic profile.\nResults: To improve access to massive genomic data resources, we have developed a fast search engine, GEMINI,\nthat uses a genomic profile as a query to search for similar genomic profiles. GEMINI implements a nearest-neighbor\nsearch algorithm using a vantage-point tree to store a database of n profiles and in certain circumstances achieves an\nO(log n) expected query time in the limit. We tested GEMINI on breast and ovarian cancer gene expression data from\nThe Cancer Genome Atlas project and show that it achieves a query time that scales as the logarithm of the number\nof records in practice on genomic data. In a database with 105 samples, GEMINI identifies the nearest neighbor in 0.05\nsec compared to a brute force search time of 0.6 sec.\nConclusions: GEMINI is a fast search engine that uses a query genomic profile to search for similar profiles in a very\nlarge genomic database. It enables users to identify similar profiles independent of sample label, data origin or other\nmeta-data information.
Loading....